Notes: 用户低估来信息接受人数 ***
Notes:对于连续的变量,要探究其相关性时,散点图是最好的选择
pf <- read.csv("../lesson3/pseudo_facebook.tsv", sep="\t")
library(ggplot2)
qplot(x=age, y=friend_count, data=pf)
ggplot(aes(x=age, y=friend_count), data=pf) +
geom_point()
Response: 多点过于集中,年龄可能存在虚假报告 ***
Notes:
ggplot(aes(x=age, y=friend_count), data=pf) +
geom_point(alpha=0.05) +
xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).
Notes:因为点重叠无法真是反应出年龄和朋友数量的关系(因为年龄是连续性书之后),可以使用jitter方法来人为增加noise,以增强显示点。下图就是使用jitter后展现效果
ggplot(aes(x=age, y=friend_count), data=pf) +
geom_jitter(alpha=0.05) +
xlim(13, 90)
## Warning: Removed 5159 rows containing missing values (geom_point).
Response:在60岁左右的用户的朋友数具有明显的数量多;即使是年轻人朋友数量也主要集中在2000人以下
Notes:[coor_trans文档](http://ggplot2.tidyverse.org/reference/coord_trans.html) is different to scale transformations in that it occurs after statistical transformation and will affect the visual appearance of geoms - there is no guarantee that straight lines will continue to be straight.
示例: coord_trans(x = “identity”, y = “identity”, limx = NULL, limy = NULL, xtrans, ytrans)
ggplot(aes(x=age, y=friend_count), data=pf) +
geom_point(alpha=0.05) +
xlim(13, 90)+
coord_trans(y="sqrt")
## Warning: Removed 4906 rows containing missing values (geom_point).
在使用jitter是需要注意,如果直接使用jitter的话,可能因为点在0上使用sqrt之后而出现虚数,为了避免该情况,使用position参数来指定针对age进行抖动。下图是使用position之后的图,可以看出没有出现y轴0以下的数。关于position_jitter解释
ggplot(aes(x=age, y=friend_count), data=pf) +
geom_point(alpha=0.05, position=position_jitter(h=0)) +
xlim(13, 90)+
coord_trans(y="sqrt")
## Warning: Removed 5184 rows containing missing values (geom_point).
Notes:
ggplot(aes(x=age, y=friendships_initiated), data=pf) +
geom_jitter(alpha=0.05) +
xlim(13, 90)
## Warning: Removed 5193 rows containing missing values (geom_point).
ggplot(aes(x=age, y=friendships_initiated), data=pf) +
geom_point(alpha=0.05, position=position_jitter(h=0)) +
coord_trans(y="sqrt")+
xlim(13, 90)
## Warning: Removed 5174 rows containing missing values (geom_point).
Notes:
Notes:使用dplyr库,进行相关处理,注意函数n不需要使用参数——它是直接用于结算数量的
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
pf.fc_by_age <- pf %>% group_by(age) %>%
summarise(mean=mean(friend_count), median=median(friend_count), n=n()) %>%
arrange(age)
pf.fc_by_age
## # A tibble: 101 x 4
## age mean median n
## <int> <dbl> <dbl> <int>
## 1 13 164.7500 74.0 484
## 2 14 251.3901 132.0 1925
## 3 15 347.6921 161.0 2618
## 4 16 351.9371 171.5 3086
## 5 17 350.3006 156.0 3283
## 6 18 331.1663 162.0 5196
## 7 19 333.6921 157.0 4391
## 8 20 283.4991 135.0 3769
## 9 21 235.9412 121.0 3671
## 10 22 211.3948 106.0 3032
## # ... with 91 more rows
Create your plot!
ggplot(aes(x=age, y=mean), data=pf.fc_by_age) +
geom_line() +
scale_x_continuous(limits=c(13,90))
## Warning: Removed 23 rows containing missing values (geom_path).
Notes:如果需要在图形上添加摘要信息,可以使用相关的图形,并且加入相关的统计信息参数,如geom_line(stat=“summary”, fun.y=mean)——其中stat表示统计方法,fun.y针对的是使用统计的函数
ggplot(aes(age, friend_count), data=pf) +
coord_cartesian(xlim=c(13,90), ylim=c(0,2000)) +
geom_point(alpha=0.05, position=position_jitter(h=0), color="orange") +
coord_trans(y="sqrt") +
geom_line(stat="summary", fun.y=mean) +
geom_line(stat="summary", fun.y=quantile, fun.args=list(probs=.1), linetype=2, color="blue") +
geom_line(stat="summary", fun.y=quantile, fun.args=list(probs=.9), linetype=2, color="blue") +
geom_line(stat="summary", fun.y=quantile, fun.args=list(probs=.5), color="blue")
Response: 年龄越大朋友数量反而会表现增大的趋势,所有的朋友数都集中在1000以下 ***
See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.
Notes:
Notes:
with(pf, cor(age, friend_count))
## [1] -0.02740737
Look up the documentation for the cor.test function.
What’s the correlation between age and friend count? Round to three decimal places. Response:
Notes: with方法的使用是不仅减少了数据输入,而且加快运行速度
with(subset(pf, age<=70), cor.test(age, friend_count))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1780220 -0.1654129
## sample estimates:
## cor
## -0.1717245
Notes:
with(subset(pf, age<=70), cor.test(age, friend_count, method="spearman"))
## Warning in cor.test.default(age, friend_count, method = "spearman"): Cannot
## compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: age and friend_count
## S = 1.5782e+14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.2552934
Notes:
ggplot(aes(x=www_likes_received, y=likes_received), data=pf) +
geom_point(alpha=.05) +
coord_cartesian(ylim=c(0,10000), xlim=c(0, 6000))
Notes:
ggplot(aes(x=www_likes_received, y=likes_received), data=pf) +
geom_point() +
xlim(0, quantile(pf$www_likes_received, .95)) +
ylim(0, quantile(pf$likes_received, 0.95)) +
geom_smooth(method="lm", color="red")
## Warning: Removed 6075 rows containing non-finite values (stat_smooth).
## Warning: Removed 6075 rows containing missing values (geom_point).
What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.
with(pf, cor.test(www_likes_received, likes_received), method="pearson")
##
## Pearson's product-moment correlation
##
## data: www_likes_received and likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9473553 0.9486176
## sample estimates:
## cor
## 0.9479902
Response:
Notes: 相关性对于探究多变量之间关系具有重要且直观的意义,可以推动是否需要继续研究两者之间关系。但是需要注意,直接使用相关性探究可能得到的数值是具有强相关性,但是在推荐使用scatter plot来继续查看是否具有某种相关性。 ***
Notes:
library(alr3)
## Loading required package: car
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
Create your plot!
ggplot(aes(Month, Temp), data=Mitchell) +
geom_point()
with(Mitchell, cor.test(Month, Temp))
##
## Pearson's product-moment correlation
##
## data: Month and Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08053637 0.19331562
## sample estimates:
## cor
## 0.05747063
Take a guess for the correlation coefficient for the scatterplot.
What is the actual correlation of the two variables? (Round to the thousandths place)
Notes:
ggplot(aes(Month, Temp), data=Mitchell) +
geom_point() +
scale_x_continuous(breaks=seq(0, max(Mitchell$Month), 11)) +
geom_path()
What do you notice? Response:从时间周期来看,一个人的年龄还可以延伸到多少岁几个月——因为具有时间周期性,所以接下来将使用增加月份来探究
Watch the solution video and check out the Instructor Notes! Notes:因为年龄里面的线不是平滑的,猜测具有噪点,因此通过对年龄调整来把年龄数据细化在探索
Notes:
pf$age_with_month <- with(pf, age + (1 - dob_month/12))
pf.fc_by_age_months <- pf %>%
group_by(age_with_month) %>%
summarise(mean=mean(friend_count), median=median(friend_count), n=n()) %>%
arrange(age_with_month)
head(pf.fc_by_age_months)
## # A tibble: 6 x 4
## age_with_month mean median n
## <dbl> <dbl> <dbl> <int>
## 1 13.16667 46.33333 30.5 6
## 2 13.25000 115.07143 23.5 14
## 3 13.33333 136.20000 44.0 25
## 4 13.41667 164.24242 72.0 33
## 5 13.50000 131.17778 66.0 45
## 6 13.58333 156.81481 64.0 54
Programming Assignment
ggplot(aes(age_with_month, mean), data=pf.fc_by_age_months) +
geom_line() +
xlim(13, 70) +
ylim(0, 500)
## Warning: Removed 511 rows containing missing values (geom_path).
针对
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
p1 <- ggplot(aes(x=age, y=mean), data=subset(pf.fc_by_age, age<71)) +
geom_line() #只探索0-70岁的用户
p2 <- ggplot(aes(age_with_month, mean), data=subset(pf.fc_by_age_months, age_with_month<71)) +
geom_line() #增加噪点研究年龄上的平均朋友数变化
grid.arrange(p2, p1, ncol=1)
减小年龄间的范围,参考方法是使用局部回归思路,直观图解见文档 ### Smoothing Conditional Means Notes:
library(gridExtra)
p1 <- ggplot(aes(age, mean), data=subset(pf.fc_by_age, age<71)) +
geom_line() #只探索0-70岁的用户
p2 <- ggplot(aes(age_with_month, mean), data=subset(pf.fc_by_age_months, age_with_month<71)) +
geom_line() #增加噪点研究年龄上的平均朋友数变化
p3 <- ggplot(aes(round(age/5)*5, friend_count), data=subset(pf, age<71))+
geom_line(stat="summary", fun.y=mean) #通过更改用户年龄来改变绘图是用到的年龄段,减少noise——将5岁年龄内的数据归为一类
grid.arrange(p2, p1, p3, ncol=1)
library(gridExtra)
p1 <- ggplot(aes(age, mean), data=subset(pf.fc_by_age, age<71)) +
geom_line() +
geom_smooth() #只探索0-70岁的用户
p2 <- ggplot(aes(age_with_month, mean), data=subset(pf.fc_by_age_months, age_with_month<71)) +
geom_line() +
geom_smooth() #增加噪点研究年龄上的平均朋友数变化
p3 <- ggplot(aes(round(age/5)*5, friend_count), data=subset(pf, age<71))+
geom_line(stat="summary", fun.y=mean) #通过更改用户年龄来改变绘图是用到的年龄段,减少noise——将5岁年龄内的数据归为一类
grid.arrange(p2, p1, p3, ncol=1)
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
从上图看出,在第二种方案中更容易产生系统误差 ***
Notes: 数据折衷是多种方案的选择,不适单独某个方案的能够确定关系。所以数据分析时,最佳方案时选择多个方案进行分析 ***
Reflection:散点图,相关系数,模型判断,jitter,多方案折衷
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!